Load required packages:
library(RColorBrewer) ## nicer color schemes
library(xcms) ## the package doing the job
library(tidyverse) ## potentially useful
library(knitr)
library(plotly)
library(FactoMineR)
library(factoextra)
library(effectsize)
We will analyze a subset of the data from UPLC-QTof MS untargeted analysis of Vitis vinifera L. leaves, collected in Italy and Germany from two fungus-resistant grape varieties: Regent and Phoenix. For this tutorial, we will focus on the quest of metabolic “biomarkers” for the geographical origin within Phoenix samples,
The raw data files (both in mzML and CDF format) are available from the MetaboLights repository. To speed up processing of this exercise we will restrict the analysis to the following subset of study files:
QC:
L011_QC_RP_pos01.CDF
L003_QC_RP_pos01.CDF
L018_QC_RP_pos01.CDF
Italy Phoenix:
L015_010_LPIi_RP_pos01.CDF
L006_002_LPIh_RP_pos01.CDF
L026_019_LPIc_RP_pos01.CDF
Germany Phoenix:
L023_017_LPGc_RP_pos01.CDF
L037_029_LPGe_RP_pos01.CDF
L047_037_LPGa_RP_pos01.CDF
You have to download them (in .CDF format) and save them on your PC in the same directory as this .Rmd file. Each file contains data in centroid mode acquired in positive ESI mode over a mass range from 50 to 2000 Da by using a chromatography of 28 minutes.
Note A large part of this tutorial is taken from the official vignette of xcms. Many thanks to Steffen Neumann and Johannes Rainer!
We start getting the raw data into R:
## Get the full path to the CDF files
cdfs <- list.files("../data/cdfs_grape/", full.names = TRUE)
cdfs <- cdfs[grepl("CDF", cdfs)]
cdfs
## [1] "../data/cdfs_grape//L003_QC_RP_pos01.CDF" "../data/cdfs_grape//L006_002_LPIh_RP_pos01.CDF"
## [3] "../data/cdfs_grape//L011_QC_RP_pos01.CDF" "../data/cdfs_grape//L015_010_LPIi_RP_pos01.CDF"
## [5] "../data/cdfs_grape//L018_QC_RP_pos01.CDF" "../data/cdfs_grape//L023_017_LPGc_RP_pos01.CDF"
## [7] "../data/cdfs_grape//L026_019_LPIc_RP_pos01.CDF" "../data/cdfs_grape//L037_029_LPGe_RP_pos01.CDF"
## [9] "../data/cdfs_grape//L047_037_LPGa_RP_pos01.CDF"
In the last few years the xcms developers have been making a big effort to make their package coherent with a general framework for the handling of MS data in R (metabolomics, proteomics …). All this goes beyond the scope of our course, for us is sufficient to know that this infrastructure allows to store sample “metadata” (e.g. treatment class, time point, etc) together with the raw experimental data.
In our specific case, the data frame with the phenotipic data could be designed as follow:
phenodata <- tibble(matrix(ncol = 5, nrow = length(cdfs)))
colnames(phenodata) <- c("filename", "type", "country", "variety", "class")
phenodata$filename <- cdfs
phenodata$type <- "leaves"
phenodata$type[grep("QC", cdfs)] <- "QC"
phenodata$country <- "QC"
phenodata$country[grep("I", cdfs)] <- "italy"
phenodata$country[grep("G", cdfs)] <- "germany"
phenodata$variety <- "QC"
phenodata$variety[grep("LP", cdfs)] <- "phoenix"
phenodata$class <- paste(phenodata$country, phenodata$variety, sep = "_")
phenodata$class <- gsub("QC_QC", "QC", phenodata$class)
phenodata
## # A tibble: 9 × 5
## filename type country variety class
## <chr> <chr> <chr> <chr> <chr>
## 1 ../data/cdfs_grape//L003_QC_RP_pos01.CDF QC QC QC QC
## 2 ../data/cdfs_grape//L006_002_LPIh_RP_pos01.CDF leaves italy phoenix italy_phoenix
## 3 ../data/cdfs_grape//L011_QC_RP_pos01.CDF QC QC QC QC
## 4 ../data/cdfs_grape//L015_010_LPIi_RP_pos01.CDF leaves italy phoenix italy_phoenix
## 5 ../data/cdfs_grape//L018_QC_RP_pos01.CDF QC QC QC QC
## 6 ../data/cdfs_grape//L023_017_LPGc_RP_pos01.CDF leaves germany phoenix germany_phoenix
## 7 ../data/cdfs_grape//L026_019_LPIc_RP_pos01.CDF leaves italy phoenix italy_phoenix
## 8 ../data/cdfs_grape//L037_029_LPGe_RP_pos01.CDF leaves germany phoenix germany_phoenix
## 9 ../data/cdfs_grape//L047_037_LPGa_RP_pos01.CDF leaves germany phoenix germany_phoenix
Below we restrict the phenodata matrix to our samples of interest (in this case, for the tutorial part QC samples and all samples from the variety Phoenix):
Up to now nothing has been actually loaded into R. To do that:
## As you can see the data frame with the phenotypic data is included inside the object holding the raw data
raw_data <- readMSData(
files = phenodata$filename,
pdata = new("NAnnotatedDataFrame",
phenodata), ## this is the structure of xcms holding phenotypic data
mode = "onDisk") ## with this parameter the data are not loaded into RAM
Loading the full dataset into RAM can be problematic for large studies so with this specific on disk mode the raw data are still staying on the disk.
We next restrict the data set to the retention time range from 800 to 1000 seconds and to the mass-to-charge ratio range from 50 to 1000, just to save some processing time …
## These two xcms functions are used to subset the raw data
raw_data <- filterRt(raw_data, c(800, 1000))
raw_data <- filterMz(raw_data, c(50, 1000))
It is important that along the process one can be able to visualize the raw data, so let’s give a look to the structure of the R object we created.
The raw_data object contains the full set of 3D data collected in all our samples. The “raw” values can be extracted by using:
rt <- rtime(raw_data)
mz <- mz(raw_data)
I <- intensity(raw_data)
Let’s look to the structure of these three objects:
glimpse(rt)
## Named num [1:2512] 800 802 803 803 804 ...
## - attr(*, "names")= chr [1:2512] "F1.S1120" "F1.S1121" "F1.S1122" "F1.S1123" ...
These are the retention times in seconds for the chromatography of all loaded files. For example, F1.S1120 stands for File1, scan number 1120 and it was recorded at 800 seconds.
Another way to see that:
plot(rt)
## The full number of scans for the set of 9 files
length(rt)
## [1] 2512
Here the individual lines highlight the increasing time scale for each file.
mz and I holds the mass spectra collected at each scantime … for this reason the two objects are lists and not vectors. Remember our data are 3D. For each scantime we have a complete mass spectrum.
## only the first 20
glimpse(mz[1:10])
## List of 10
## $ F1.S1120: num [1:3416] 54.7 54.9 55 55.1 59.3 ...
## $ F1.S1121: num [1:3366] 50.7 52.1 53.1 53.8 55 ...
## $ F1.S1122: num [1:3333] 51.9 53.5 53.8 55 55.1 ...
## $ F1.S1123: num [1:3364] 50.5 54.8 55 60 63.7 ...
## $ F1.S1124: num [1:3387] 50.5 53 53.9 55 56.9 ...
## $ F1.S1125: num [1:3404] 50 50.7 53.4 56.9 57.7 ...
## $ F1.S1126: num [1:3424] 60.2 61.5 62.4 63.5 63.7 ...
## $ F1.S1127: num [1:3356] 50.2 50.5 52.7 55.1 56 ...
## $ F1.S1128: num [1:3368] 50.5 54.4 55 55 56.9 ...
## $ F1.S1129: num [1:3483] 51.3 54.5 55.1 55.5 55.7 ...
## only the first 20
glimpse(I[1:10])
## List of 10
## $ F1.S1120: num [1:3416] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1121: num [1:3366] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1122: num [1:3333] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1123: num [1:3364] 1.01 1.01 2.03 1.01 1.01 ...
## $ F1.S1124: num [1:3387] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1125: num [1:3404] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1126: num [1:3424] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1127: num [1:3356] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1128: num [1:3368] 1.01 1.01 1.01 1.01 1.01 ...
## $ F1.S1129: num [1:3483] 1.01 1.01 1.01 1.01 1.01 ...
We can plot a complete spectrum (here the first scan of the first sample …):
plot(mz[[1]], I[[1]], type = "h",
main = names(mz)[1], xlab = "m/z", ylab = "intensity")
xcms provides tools to visualize and play with the raw data, in a more direct way:
## the size of our raw data
length(raw_data)
## [1] 2512
is exactly the number of scans, and:
raw_data[[1]]
## Object of class "Spectrum1"
## Retention time: 13:20
## MSn level: 1
## Total ion count: 3416
## Polarity: -1
is indeed a spectrum, which has a plot method:
plot(raw_data[[1]])
This is a sort of gg stuff so if we want an interactive stuff we can rely on the plotly package, and also change some of the characteristics of the graphical layout:
ggplotly((plot(raw_data[[1]])))
Ok, working with all files together is not the best … for visualization and handling. To facilitate the “cutting” by file, xcms is provided with a split() function which can be combined with a fromFile function to create a list with the content separate by file:
single_raw <- split(raw_data, fromFile(raw_data))
and each element of the list is now a single raw data:
single_raw[[1]]
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.13 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 279
## MSn retention times: 13:20 - 16:39 minutes
## - - - Processing information - - -
## Data loaded [Thu Nov 4 11:12:08 2021]
## Filter: select retention time [800-1000] and MS level(s), 1 [Thu Nov 4 11:12:08 2021]
## Filter: trim MZ [50..1000] on MS level(s) 1.
## MSnbase version: 2.18.0
## - - - Lazy processing queue content - - -
## o filterMz
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: filename type ... class (5 total)
## varMetadata: labelDescription
## Loaded from:
## L003_QC_RP_pos01.CDF
## protocolData: none
## featureData
## featureNames: F1.S1120 F1.S1121 ... F1.S1398 (279 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
As we discussed, metabolites are visible as peaks in 3D mz/rt/intensity plane …
mytibble <- tibble(rt = rtime(single_raw[[1]]),
mz = mz(single_raw[[1]]),
I = intensity(single_raw[[1]]))
mytibble
## # A tibble: 279 × 3
## rt mz I
## <dbl> <named list> <named list>
## 1 800. <dbl [3,416]> <dbl [3,416]>
## 2 802. <dbl [3,366]> <dbl [3,366]>
## 3 803. <dbl [3,333]> <dbl [3,333]>
## 4 803. <dbl [3,364]> <dbl [3,364]>
## 5 804. <dbl [3,387]> <dbl [3,387]>
## 6 805. <dbl [3,404]> <dbl [3,404]>
## 7 805. <dbl [3,424]> <dbl [3,424]>
## 8 806. <dbl [3,356]> <dbl [3,356]>
## 9 807. <dbl [3,368]> <dbl [3,368]>
## 10 807. <dbl [3,483]> <dbl [3,483]>
## # … with 269 more rows
And now a fancy plot ….
jet.colors <- colorRampPalette(
c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F",
"yellow", "#FF7F00", "red", "#7F0000"))
mapplot <- mytibble %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.5) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
mapplot +
xlim(920, 970) +
ylim(412, 420)
The second thing we would like to visualize is one extracted ion trace:
## Define the rt and m/z range of the peak area
rtr <- 15.6*60
mzr <- 413.1207
## extract the chromatogram
chr_raw <- chromatogram(raw_data,
mz = mzr + 0.1*c(-1, 1),
rt = rtr + 30*c(-1, 1))
plot(chr_raw)
So we are able to actually see the chromatographic peak of the m/z 413.
When one is dealing with the initial investigation of the data the first thing to do is to look to the total ion current of each chromatogram or to the base peak ion chromatogram:
## Get the total ion chromatograms. This reads data from all files.
tics <- chromatogram(raw_data, aggregationFun = "sum")
## Define colors for the two groups:
group_colors <- paste0(brewer.pal(3, "Set1"), "60")
names(group_colors) <- c("QC", "italy", "germany")
## Plot all chromatograms:
plot(tics, col = group_colors[raw_data$country])
## raw_data$country extracts the info on the phenotipic data inside the raw_data
As you can see, the results are different with a “common” look and feel. The large majority of peaks is there, even if the intensities are not always comparable. Red traces (QCs) are highly reproducible, speaking of an overall good reproducibility of the analytical pipeline.
The overall integral of the signal in each sample is often used as a way to spot gross analytical drifts
## here we rely on the old (and efficient) R style
total_I <- sapply(tics, function(x) sum(intensity(x)))
plot(total_I, col = gsub("60", "", group_colors[raw_data$country]), pch = 19, cex = 2)
We clearly see the overall difference between blue and green samples. This already indicate that the two classes of samples are strongly different. This is not really important at this point, but one should keep this in mind when we will perform the statistical analysis. In general, an important thing to look at here is the TIC of QCs: if the analytical performance of the sistem is optimal, they should be comparable.
A second and often less general quality check of the experimental data rely on the inspection of the chromatographic trace of one (or more) known compounds. In the case at hand, we know that a specific metabolite present in my samples yields an ion @mz413… let’s look to the profile of this ion signal over the chromatographic time:
## here we get the traces ... compare the function with the one used for the TICs
ion_I_know <- chromatogram(raw_data, mz = mzr + 0.1*c(-1, 1))
plot(ion_I_know, col = group_colors[raw_data$country])
The previous plot is important: it is telling us that the metabolite_I_know is present in the samples and is released by the chromatographic column at around 935 sec … There it is producing a peak in the signal of the ion_I_know @mz413
To automatically find metabolites in my data I have to teach to the computer to look for peaks in the chromatographic traces of all possible ions.
The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.
A full description of the parameters of the algorithm can be found in the xcms manual, here we focus on:
In xcms the parameters of the algorithm are stored into a specific object:
mf <- MatchedFilterParam(binSize = 0.1,
fwhm = 6,
snthresh = 6)
mf
## Object of class: MatchedFilterParam
## Parameters:
## - binSize: [1] 0.1
## - impute: [1] "none"
## - baseValue: numeric(0)
## - distance: numeric(0)
## - fwhm: [1] 6
## - sigma: [1] 2.547987
## - max: [1] 5
## - snthresh: [1] 6
## - steps: [1] 2
## - mzdiff: [1] 0.6
## - index: [1] FALSE
Now I can use the previous parameters to find the peaks in one sample:
first_peaks <- findChromPeaks(single_raw[[1]], param = mf)
The actual list of peaks can be extracted from the previous object by the method chromPeaks.
Let’s look to the head of the output:
first_peak_table <- chromPeaks(first_peaks)
dim(first_peak_table)
## [1] 383 13
head(first_peak_table, 5)
## mz mzmin mzmax rt rtmin rtmax into intf maxo maxf i sn sample
## CP001 83.04975 83.04828 83.05143 935.382 932.817 939.117 100.61855 86.77055 20.15974 21.96211 1 6.129351 1
## CP002 85.02940 85.02786 85.03004 813.489 809.747 816.054 283.34467 238.70072 62.62814 56.30592 1 10.705693 1
## CP003 93.07115 93.07047 93.07191 996.838 991.832 999.395 330.82682 326.61803 54.80862 62.14951 1 9.224595 1
## CP004 99.08113 99.07874 99.08565 996.838 993.104 999.395 72.73965 89.41898 18.42368 20.41894 1 9.315135 1
## CP005 105.03462 105.02673 105.03739 874.465 871.899 877.025 35.55708 37.69402 10.12658 11.77455 1 7.586727 1
The first two numbers are telling us that with the setting we have been choosing we were able to find 383 peaks.
The help of xcms describes the most relevant columns of the table:
“mz” (intensity-weighted mean of mz values of the peak across scans/retention times), “mzmin” (minimal mz value), “mzmax” (maximal mz value), “rt” (retention time of the peak apex), “rtmin” (minimal retention time), “rtmax” (maximal retention time), “into” (integrated, original, intensity of the peak), “maxo” (maximum intensity of the peak), “sample” (sample index in which the peak was identified)
This is the map of the identified peaks
as_tibble(first_peak_table) %>%
ggplot() +
geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") +
theme_light()
Peak picking can also be performed with another algorithm: CentWave
cwp <- CentWaveParam(peakwidth = c(5, 30),
ppm = 30,
prefilter = c(3, 100))
cwp
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 30
## - peakwidth: [1] 5 30
## - snthresh: [1] 10
## - prefilter: [1] 3 100
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 0
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
Also here many parameters (and others are not mentioned). I highlight here some of them:
If we run the peak picking with this new algorithm…
first_peaks_cw <- findChromPeaks(single_raw[[1]], param = cwp)
first_peak_table_cw <- chromPeaks(first_peaks_cw)
dim(first_peak_table_cw)
## [1] 152 11
head(first_peak_table_cw, 5)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample
## CP001 243.0677 243.0672 243.0690 802.710 800.254 807.822 711.4057 686.0903 164.0446 43 1
## CP002 139.0407 139.0399 139.0412 802.710 800.254 813.489 1817.9908 1647.9536 218.6978 17 1
## CP003 380.1419 380.1408 380.1432 802.710 800.254 805.907 1272.7966 1266.3360 307.1760 146 1
## CP004 380.1417 380.1408 380.1431 807.187 805.907 813.489 775.9329 767.5927 168.5718 81 1
## CP005 169.0518 169.0503 169.0533 803.349 800.254 810.387 1035.3783 1020.4388 133.5483 71 1
As you see the number of columns is different, but the key infos are there.
as_tibble(first_peak_table_cw) %>%
ggplot() +
geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") +
theme_light()
If we superimpose them…
t_mf <- as_tibble(first_peak_table)
as_tibble(first_peak_table_cw) %>%
ggplot() +
geom_point(aes(x = rt, y = mz), size = 2, alpha =1, pch = 21, fill = "orange") +
geom_point(data = t_mf, mapping = aes(x = rt, y = mz), size = 2, alpha =0.5, col = "steelblue") +
theme_light()
The difference is striking!
Obviously one could fiddle around with the parameters to look for a more coherent picture, but the difference is not unexpected considering the fact that we are dealing with two different approaches.
When we are satisfied with a set of peak picking parameters, the algorithm will be sequentially run on all the files of the dataset resulting in a large list of peaks assigned to the different samples.
xdata <- findChromPeaks(raw_data, param = cwp)
Here a table of the peaks found in all files:
table(chromPeaks(xdata)[, "sample"])
##
## 1 2 3 4 5 6 7 8 9
## 152 164 152 159 149 113 190 108 86
An overall representation of their distribution in the plane is extremely interesting:
chromPeaks(xdata) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = into), size = 0.3) +
facet_wrap(~sample) +
theme_light()
As you can see the samples are different, but the overall “look and feel” is coherent. This is telling us that the overall analytical run was good.
Regarding the retention time shifts…
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1, 8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
From the plot we can see a small but visible shift in RT. The shift is responsible of a difference in the samples coming from the analysis and not the biology and it has to be corrected to avoid biased results.
xcms can do much more to browse and characterize the peaks, but here we want to focus on the key ideas.
In summary:
The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
Also here a plethora of approaches is available. As usual, everything will work better if the chormatography is more reproducible (for GC, for example, retention time correction is often not necessary).
In xcms the most used and reliable method for alignment of high resolution experiments is based on the obiwarp approach. The algorithm was developed for proteomics and is based on dynamic time warping.
The alignment is performed directly on the profile-matrix and can hence be performed independently of the peak detection or peak grouping.
If the samples look quiet different among them, it might be helpful to perform the alignment based on only QC samples (or another subset of samples) and use these to adjust the full data set.
xdata <- adjustRtime(xdata, param = ObiwarpParam(
binSize = 0.2,
subset = which(xdata$type == "QC"),
subsetAdjust = "average"))
It is of utmost importance to check the amount of correction since large time shifts are not reasonable. Below we plot the BPC before and after applying the RT correction, as well as the differences of the adjusted- to the raw retention times per sample.
chr_raw <- chromatogram(xdata, aggregationFun = "max",
adjustedRtime = FALSE)
chr_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(3, 1), mar = c(2, 4.3, 2, 0.5))
plot(chr_raw, peakType = "none", main = "BPC, raw",
col = group_colors[xdata$country])
plot(chr_adj, peakType = "none", main = "BPC, adjusted",
col = group_colors[xdata$country])
plotAdjustedRtime(xdata, col = group_colors[xdata$country])
As you can see the correction is never bigger than 2 seconds. With a chromatographic peak width of around 10 seconds this is more than acceptable and, another time it speaks of a overall good analytical reproducibility.
xdata now still contains the list of the peaks for the different samples, but now they retention time should be less erratic…
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1,8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
As you can see the situation has improved and some of the vertical stripes are now well aligned.
The last step is to find a consensus list of variables across the different samples. These will be the features which will be used for the data analysis. The list of peaks is now aligned in retention time, but:
The common way of doing this step in xcms relies in a density based approach.
The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (all peaks found in all samples together!) within small slices along the m/z dimension.
Care should be taken to account for the fact that a peak could be absent in a sample or in a set of samples and to avoid, in the meantime, to keep peaks found only in one sample.
As before, the parameters of this step are included in a specific object
pdp <- PeakDensityParam(sampleGroups = xdata$country,
minFraction = 0.5,
bw = 30,
binSize = 0.1)
A set of peaks will be considered a candidate to become a “valid” group if it contains peaks coming from at least a minFraction of samples belonging to one of the sampleGroups.
An example will make this more clear. Suppose I have a dataset with two sample groups: one of 4 samples, the other of 6. If I set minFraction to 0.5, a group of peaks will be considered a feature if it contains at least:
… or more.
Grouping is finally performed with:
xdata <- groupChromPeaks(xdata, param = pdp)
The features are now the variables which will show-up in the data matrix. Their definition has been added by the groupChromPeaks method to the xdata object (which also contains the definition of the peaks of the different samples).
The definition can be extracted as a dataframe:
myfeatures <- featureDefinitions(xdata)
head(myfeatures, 5)
## DataFrame with 5 rows and 12 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks germany italy QC peakidx ms_level
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <list> <integer>
## FT001 95.0865 95.0862 95.087 849.097 848.471 850.046 3 0 3 0 180,511,926 1
## FT002 107.0871 107.0869 107.087 996.557 996.238 996.876 2 0 2 0 289,1050 1
## FT003 119.0872 119.0866 119.088 849.737 848.471 850.402 5 0 3 2 38,189,500,... 1
## FT004 121.0664 121.0663 121.066 996.877 996.876 996.878 2 0 2 0 290,1051 1
## FT005 121.1025 121.1021 121.103 996.876 996.238 996.876 4 2 2 0 291, 871,1052,... 1
The table contains:
mzmed,mzmin,mzmax,rtmed,rtmin,rtmax)npeaksThe (almost) final untargeted data matrix can be extracted from the same object with:
DM <- featureValues(xdata, value = "into")
dim(DM)
## [1] 135 9
The intensity used to build the data matrix is normally chosen as:
into: integrated, original, intensity of the peakmaxo: maximum intensity of the peakIn our simple example we have 135 variables measured over 9 samples:
head(DM)
## L003_QC_RP_pos01.CDF L006_002_LPIh_RP_pos01.CDF L011_QC_RP_pos01.CDF L015_010_LPIi_RP_pos01.CDF L018_QC_RP_pos01.CDF
## FT001 NA 874.6886 NA 1221.790 NA
## FT002 NA 949.0719 NA NA NA
## FT003 845.8922 1398.7975 NA 1202.988 828.6412
## FT004 NA 730.9143 NA NA NA
## FT005 NA 1468.2392 NA NA NA
## FT006 NA 1261.4381 NA 1393.185 NA
## L023_017_LPGc_RP_pos01.CDF L026_019_LPIc_RP_pos01.CDF L037_029_LPGe_RP_pos01.CDF L047_037_LPGa_RP_pos01.CDF
## FT001 NA 973.9651 NA NA
## FT002 NA 1061.6101 NA NA
## FT003 NA 1283.7827 NA NA
## FT004 NA 744.8442 NA NA
## FT005 651.2681 1516.9884 706.7103 NA
## FT006 NA 1452.1044 NA NA
Note that DM holds samples in columns and variables in rows, so it should be transposed to be ready for the standard analysis.
So we finally get there, we have our data matrix full of intensities, but (as usual) missing values are not absent…
Another time:
To go on we have to try to fill at least a part of the holes with a reasonable number. The function fillChromPeaks() integrates the signal found in the mz-rt region of the feature. The mz and rt ranges used by the algorithm are defined as the lower quartile of the mzmin and rtmin values of all detected peaks of the feature, and the upper quartile of the mzmax and rtmax values.
This smart approach (which works well in many cases, even if there are exceptions) is implemented in xcms with the fillChromPeaks function:
xdata_filled <- fillChromPeaks(xdata)
Now our filled data matrix looks like this…
DM_f <- featureValues(xdata_filled, value = "into")
head(DM_f)
## L003_QC_RP_pos01.CDF L006_002_LPIh_RP_pos01.CDF L011_QC_RP_pos01.CDF L015_010_LPIi_RP_pos01.CDF L018_QC_RP_pos01.CDF
## FT001 364.3755 874.6886 437.0850 1221.7900 407.8034
## FT002 311.3094 949.0719 416.2391 579.1548 343.0307
## FT003 845.8922 1398.7975 428.3805 1202.9881 828.6412
## FT004 198.9592 730.9143 320.2578 355.7047 285.6761
## FT005 852.2550 1468.2392 688.0599 1259.5293 780.4506
## FT006 561.6849 1261.4381 360.7264 1393.1849 531.4599
## L023_017_LPGc_RP_pos01.CDF L026_019_LPIc_RP_pos01.CDF L037_029_LPGe_RP_pos01.CDF L047_037_LPGa_RP_pos01.CDF
## FT001 163.3833 973.9651 240.3923 83.61931
## FT002 212.3647 1061.6101 106.3215 24.90527
## FT003 175.6653 1283.7827 265.0664 130.04801
## FT004 171.3388 744.8442 175.6182 95.46605
## FT005 651.2681 1516.9884 706.7103 404.26093
## FT006 206.0327 1452.1044 221.3511 111.78935
A clear improvement, isn’t it ? :-)
This data matrix will be the starting point of our statistical analysis…
The subsequent step is to take a look at the multivariate structure of the data to spot larger scale patterns.
Remember that at the beginning of the tutorial we have filtered the data to the RT range 800-1000 seconds to be faster, but for this part of the session we are interested in keeping the information from all chromatograms. For that, we already have processed this data and now we are just going to upload it in the workspace.
In order to be able to use the xdata stored in the file, the raw CDF should be organized as follows:
–Folder_name | – data | |– cdfs_grape | |– phoenix.Rdata, *.CDF – Grape_analysis |– analyze_the_grape.Rmd
If you put the raw data and the RData in a different folder you have to update the xdata@processingData@files putting thre the right path pointing to the correct files
To check the actual path to the raw data stored in the xdata you can use
fileNames(xdata)
## [1] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L003_QC_RP_pos01.CDF"
## [2] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L006_002_LPIh_RP_pos01.CDF"
## [3] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L011_QC_RP_pos01.CDF"
## [4] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L015_010_LPIi_RP_pos01.CDF"
## [5] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L018_QC_RP_pos01.CDF"
## [6] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L023_017_LPGc_RP_pos01.CDF"
## [7] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L026_019_LPIc_RP_pos01.CDF"
## [8] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L037_029_LPGe_RP_pos01.CDF"
## [9] "/home/franceschp/Documents/Training/Metabolomics_lectures/data/cdfs_grape/L047_037_LPGa_RP_pos01.CDF"
load("phoenix.RData")
DM_f <- featureValues(xdata, value = "into")
In order to do a first PCA with the scope of getting a general overview of the data it is necessary to impute the data matrix, replacing NAs with meaningful numbers (it is important to keep in mind that even having applied the fillChromPeaks() function, there may still be some missing values in the data matrix):
My choice here is to work variable wise and replace the NAs with a random number drawn from an uniform distribution spanning from 0 to half of the minimum value measured for that variable
The rationale behind this choice is that:
First of all I need a function to perform the imputation of a vector
myimputer <- function(v){
if (sum(is.na(v)) == 0) {
return(v)
} else {
napos <- which(is.na(v))
newval <- runif(length(napos),0, min(v, na.rm = TRUE)/2)
out <- v
out[napos] <- newval
return(out)
}
}
Now we apply it to the full set of columns
DM_f <- t(DM_f)
set.seed(123)
DM_i <- apply(DM_f, 2, myimputer)
Below we apply log-transformation since usually the distribution of metabolomics data intensities is far from being normal. Additionally, in this case example, the number of samples for each group is extremely low.
DM_i <- log10(DM_i)
And now PCA!
myPCA <- PCA(DM_i, graph = FALSE)
fviz_pca_ind(myPCA,
habillage = factor(xdata$country),
geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
The amount of variability captured in this representation is already the 70%, and the plot shows the presence of a clear separation between sample classes.
What about the variables?
fviz_pca_biplot(myPCA,
habillage = factor(xdata$country),
geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
The biplot here seems to suggest that Italian samples probably show higher levels for the large majority of signals. Even if the biplot here is not really useful to spot clera patterns …
What we did so far is called exploratory data analysis. To write the paper we should also find the variables which are showing a significant effect of the design factors.
Notes
We will now employ the parametric test (the Student's t-Test) using in this case the log-transformed values.
class <- xdata$country[xdata$type != "QC"]
stats.tt <- function(x){t.test(x ~ class)$p.value}
pval_tt <- apply(DM_i[xdata$class != "QC",], 2, stats.tt)
hist(pval_tt, breaks = 20, xlim = c(0, 1), col = rgb(0, 0, 1, 1/4))
If there would be no difference the distribution of p-values is expected to be uniform … there is a clear enrichment of small p-values and it speaks of the presence of a significant fraction of differences between samples from Italy and Germany. This result is not unexpected considering the large difference observed in the PCA
random_DM <- DM_i[xdata$class != "QC",]
random_DM <- random_DM[sample(1:6),]
pval_tt_rnd <- apply(random_DM, 2, stats.tt)
hist(pval_tt, breaks = 20, xlim = c(0, 1), col=rgb(0, 0, 1, 1/4),
main = "Correctight and random labels")
hist(pval_tt_rnd, breaks = 20, xlim = c(0, 1), col=rgb(1, 0, 0, 1/4), add = T)
so far we know that the two classes are different, but to start with the biomarker annotation job we would need to decide a criterion to prioritize a subset of the features …
A sensible choice would be to rank the variables on the base of their contrast between the two classes… so we would be interested on focusing on those features which, in addition to being statistically significant, show high differences between their respective mean values. The presence of an “high” contrast between the two classess can be measured in terms of effect size. Specifically we’ll use the Cohen's d.
stats.cd <- function(x){cohens_d(x ~ class)$Cohens_d}
cd <- apply(DM_i[xdata$class != "QC",], 2, stats.cd)
The Cohens’d and the p-value can be represented in a volcano plot
plot(cd,1-pval_tt, xlab = "Cohen's D", main = "Volcano Plot", pch = 3)
The crosses in the upper extremes corresponds to features with high statistical significance and large effect size
Now we join all the outputs in a unique data matrix and we sort the results by effect size:
res <- cbind(
as.data.frame(featureDefinitions(xdata))[,c("mzmed", "rtmed")], pval_tt, cd)
res$rtmed <- res$rtmed/60
res <- res[order(res[,"cd"]),]
head(res)
## mzmed rtmed pval_tt cd
## FT182 225.1504 13.518290 2.345346e-06 -57.67963
## FT581 443.1593 6.582005 1.123973e-05 -37.97001
## FT021 119.0869 20.403369 1.478472e-05 -21.71849
## FT337 324.1072 9.109826 5.815651e-04 -21.68161
## FT016 114.0560 1.464615 7.269864e-04 -21.04946
## FT431 369.1030 8.225872 2.196857e-04 -18.27667
tail(res)
## mzmed rtmed pval_tt cd
## FT363 337.0194 17.15362 3.565600e-04 11.23554
## FT305 319.0102 16.36280 1.246203e-03 12.46449
## FT392 349.1273 10.35626 1.022902e-03 12.50758
## FT726 596.1504 20.24369 1.602160e-04 15.47363
## FT427 365.1217 15.39883 5.004988e-05 15.97732
## FT282 309.0423 10.84316 1.248875e-05 34.06363
This is the list of “biomarkers”, let’s plot one of them …
sample_colors <- group_colors[xdata$country]
set.seed(123)
ptx <- as.numeric(factor(xdata$class)) + runif(length(xdata$class), -0.2, 0.2)
plot(ptx, DM_f[,rownames(res)[1]], ylab = "intensity",
col = gsub("60", "", sample_colors), pch = 19, xaxt = "n",
ylim = c(0, max(DM_f[,rownames(res)[1]])),
main = rownames(res)[1])
axis(1, seq(3), levels(factor(xdata$class)))
Which exactly shows what we expect.
Furthermore, we can also manually check its peak shape, as well as the integrated area. Within xcms there is the function featureChromatograms() which allows to visualize the EIC of an specific feature, displaying the area integrated by the algorithm:
ft_chr1 <- featureChromatograms(xdata, features = rownames(res)[1],
expandRt = 15, filled = FALSE)
ft_chr2 <- featureChromatograms(xdata, features = rownames(res)[1],
expandRt = 15, filled = TRUE)
par(mfrow = c(1, 2))
plot(ft_chr1, col = group_colors[xdata$country],
peakBg = sample_colors[chromPeaks(ft_chr1)[, "sample"]])
plot(ft_chr2, col = group_colors[xdata$country],
peakBg = sample_colors[chromPeaks(ft_chr2)[, "sample"]])
legend("topright", col = gsub("60", "", group_colors),
legend = names(group_colors), pch = 19)
Another way to check the feature of interest is using the function plotChromPeakDensity():
chr_mzr <- chromatogram(xdata, mz = res$mzmed[1] + 0.01 * c(-1, 1))
plotChromPeakDensity(chr_mzr, col = group_colors, param = pdp,
peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakPch = 16)
The upper panel in the plot shows the EIC with the detected peaks highlighted, whereas the lower plot indicates which individual peaks were detected, as well as how they have been grouped within features (indicated with grey rectangles). The black line represents the density distribution of detected peaks along the retention times.
In this example we can see that there have been detected 2 features with this m/z value at different retention times. The plot tells us that our feature of interest (eluted around 800 seconds) has initially been detected only in Italian (blue) samples. Although the plot shows that this feature was not detected in QC neither in samples from Germany, with the boxplot we have seen that there is also a value for these samples. Note that this value has obtained during the application of the function fillChromPeaks(). This is clearly seen when we use the function featureChromatograms() and playing with the argument filled.